Audio description from image by modal translation network
نویسندگان
چکیده
Audio is the main form for visually impaired to obtain information. In reality, all kinds of visual data always exist, but audio does not exist in many cases. order help people better perceive information around them, an image-to-audio-description (I2AD) task proposed generate descriptions from images this paper. To complete totally new task, a modal translation network (MT-Net) auditory sense proposed. The MT-Net includes three progressive sub-networks: 1) feature learning, 2) cross-modal mapping, and 3) generation. First, learning sub-network aims learn semantic features image audio, including learning. Second, mapping transforms into representation with same concept as feature. way, correlation inter-modal effectively mined easing heterogeneous gap between audio. Finally, generation designed waveform representation. generated interpolated corresponding file according sample frequency. Being first attempt explore I2AD large-scale datasets plenty manual are built. Experiments on verify feasibility generating intelligible directly effectiveness method.
منابع مشابه
grammatical adjustments in translation from english into persian by iranian students
بمنظور مشخص نمودن ارتباط میان برخی از ساختارهای دستور زبان انگلیسی و میزان دشواری این ساختارها در ترجمه سولات تحقیق و فرضیات صفر بشرح ذیل مطرح گردید: 1 - آیا دانشجویان ایرانی مشکلاتی در ارتباط با سازش دستوری در ترجمه از زبان انگلیسی به زبان فارسی دارند؟ 2 - آیا رابطه ای بین فرمهای دستوری زبان انگلیسی و میزان دشواری این فرمها در ترجمه از زبان انگلیسی به زبان فارسی وجود دارد؟ فرضیه 1 : دانشجویان ...
15 صفحه اولFrom Image Annotation to Image Description
In this paper, we address the problem of automatically generating a description of an image from its annotation. Previous approaches either use computer vision techniques to first determine the labels or exploit available descriptions of the training images to either transfer or compose a new description for the test image. However, none of them report results on the effect of incorrect label d...
متن کاملCross-modal Visual-audio Priming
This study assessed whether presenting visual-only stimuli prior to auditory stimuli facilitates the recognition of spoken words in noise. The results of the study indicate that this type of cross-modal priming does occur. Future directions for research in this domain are presented.
متن کاملProcessing Multi-modal Primitives from Image Sequences
In this paper, we describe a new kind of image representation in terms of local multi–modal Primitives. Our local Primitives can be characterized by three properties: (1) They represent different aspects of the image in terms of multiple visual modalities. (2) They are adaptable according to context. (3) They provide a condensed representation of local image structure. These three properties ma...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Neurocomputing
سال: 2021
ISSN: ['0925-2312', '1872-8286']
DOI: https://doi.org/10.1016/j.neucom.2020.10.053